CGN, an annotated corpus of spoken Dutch

نویسندگان

  • Ineke Schuurman
  • Machteld Schouppe
  • Heleen Hoekstra
  • Ton van der Wouden
چکیده

Although there are two variants of Dutch, the northern variant being the one used in the Netherlands and the southern variant in Flanders (Belgium), one corpus of spoken Dutch is under construction, the Spoken Dutch Corpus (CGN). In this paper first the principles of this corpus will be discussed, thereafter a few small case studies will show what the merits of such a corpus are.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Syntactic Annotation for the Spoken Dutch Corpus Project (CGN)

Of the ten million words of contemporary standard Dutch in the Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN), a selection of one million words of natural spoken language will be annotated syntactically. In the present paper we discuss the tag sets and the annotation procedures that are currently being developed and tested. The annotation tags provide information about syntactic constit...

متن کامل

JASMIN-CGN: Extension of the Spoken Dutch Corpus with Speech of Elderly People, Children and Non-natives in the Human-Machine Interaction Modality

Large speech corpora (LSC) constitute an indispensable resource for conducting research in speech processing and for developing real-life speech applications. In 2004 the Spoken Dutch Corpus (CGN) became available, a corpus of standard Dutch as spoken by adult natives in the Netherlands and Flanders. Owing to budget constraints, CGN does not include speech of children, non-natives, elderly peop...

متن کامل

Syntactic Analysis in the Spoken Dutch Corpus (CGN)

The paper describes the syntactic annotation of the Spoken Dutch Corpus (“Corpus Gesproken Nederlands” or CGN), the Dutch-Flemish project (1998-2003) aiming at the collection, description and annotation of ten million words of spoken Dutch. In the first part, the background of the parsing strategy is discussed, as well as some details concerning the actual implementation of the parsing process....

متن کامل

Automatic Phonemic Labeling and Segmentation of Spoken Dutch

The CGN corpus (Oostdijk, 2000) (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of th...

متن کامل

Example-Based Treebank Querying with GrETEL - now also for Spoken Dutch

Although several syntactically annotated corpora (or treebanks) exist for Dutch, they are seldomly used for descriptive linguistic research because there are no easy-to-use exploitation tools available. This demonstration paper describes GrETEL, a linguistic search engine (http:// nederbooms.ccl.kuleuven.be/eng/gretel) that enables non-technical users to consult treebanks in a user-friendly way...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003